Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flyteadmin digest comparison should rely on database semantics #6058

Conversation

popojk
Copy link
Contributor

@popojk popojk commented Nov 29, 2024

Tracking issue

Closes #4780

Why are the changes needed?

In current TaskManager CreateTask code, FlyteAdmin checks if a task with the same ID already exists in the database. If it does, FlyteAdmin verifies whether the registered task has a different digest compared to the existing task. If no task with the same ID is found in the database, FlyteAdmin proceeds to create the task in the database.

However, the current approach may lead to a race condition that prevents the digest comparison for two identical tasks from occurring. For example, consider two identical tasks (tasks with the same ID and digest), A and B, being registered to FlyteAdmin simultaneously. It is likely that the digest check will be skipped because the existing task is not yet present in the database. Consequently, one task will be created in the database, and the other will fail due to a primary key conflict. (Refer to the diagram below for a better understanding.)

截圖 2024-11-29 下午2 36 48

What changes were proposed in this pull request?

1.Do digest check in a transactional way:

The procedure of creating task should be 1. create task -> 2. if task id exists already(pramary key conflict) -> 3. do digest check. The pseudocode could look like

in the transaction:
try:
  create a task with given primary key
except:
  primary key already exists
  get existing entry with identical primary key
    if digest of existing == new entry's digest -> return `NewTaskExistsIdenticalStructureError`
    else -> return `NewTaskExistsDifferentStructureError`

In this way we can make sure that task digest will be checked even though 2 identical task registered at the same time frame. Refer to the diagram below for a better understanding.

截圖 2024-11-22 下午5 28 50

2.Write Task to DB before write Description in TaskRepo Create method:

In current TaskRepo Create method, task description is created before task. However, if TaskManger catches primary key conflict error from task description creation and try to get existing task in DB for digest check, a task not found error could possibly occurred as task is not yet created in DB, which does not make sense for user. In this PR it is proposed to write Task to DB before write Description in TaskRepo Create method.

How was this patch tested?

Set up a simple workflow with 2 tasks

截圖 2024-11-29 下午3 23 27

Write a shell script to request task registration 10 times at the same time to simulate hi concurrency situation. It is expected that each task will be registered successfully once only, otherwise the response message should shown AlreadyExists.

截圖 2024-11-29 下午3 31 06

The result show each task only registered once as we expected

截圖 2024-11-28 上午11 55 20 截圖 2024-11-28 上午11 55 46

Then, we make a shell script to register 2 groups of tasks with same ID but different digest at the same time. It is expected that TaskExistsDifferentStructureError will shown in the response

截圖 2024-11-29 下午3 40 48

The error shown as expected

截圖 2024-11-29 上午11 51 21

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Summary by Bito

This PR implements a transactional approach to resolve race conditions in FlyteAdmin's task registration process. The changes modify task creation workflow by handling primary key conflicts and performing digest checks. The implementation reorders task and description creation sequence in the database for consistent error handling.

Unit tests added: True

Estimated effort to review (1-5, lower is better): 1

…vent TaskManager CreateTask method Task not found isue

Signed-off-by: Alex Wu <[email protected]>
Signed-off-by: Alex Wu <[email protected]>
Copy link

codecov bot commented Nov 29, 2024

Codecov Report

Attention: Patch coverage is 63.63636% with 8 lines in your changes missing coverage. Please review.

Project coverage is 37.11%. Comparing base (0585fba) to head (28481a0).
Report is 41 commits behind head on master.

Files with missing lines Patch % Lines
flyteadmin/pkg/manager/impl/task_manager.go 60.00% 6 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6058   +/-   ##
=======================================
  Coverage   37.10%   37.11%           
=======================================
  Files        1318     1318           
  Lines      132326   132337   +11     
=======================================
+ Hits        49099    49112   +13     
+ Misses      78955    78952    -3     
- Partials     4272     4273    +1     
Flag Coverage Δ
unittests-datacatalog 51.58% <ø> (ø)
unittests-flyteadmin 54.12% <63.63%> (+0.01%) ⬆️
unittests-flytecopilot 30.99% <ø> (ø)
unittests-flytectl 62.33% <ø> (+0.04%) ⬆️
unittests-flyteidl 7.23% <ø> (-0.01%) ⬇️
unittests-flyteplugins 53.82% <ø> (ø)
unittests-flytepropeller 42.63% <ø> (ø)
unittests-flytestdlib 57.59% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Alex Wu <[email protected]>
@popojk popojk force-pushed the Flyteadmin_digest_comparison_should_rely_on_database_semantics branch from e717145 to 28481a0 Compare December 5, 2024 10:07
@@ -30,12 +30,12 @@ func (r *TaskRepo) Create(ctx context.Context, input models.Task, descriptionEnt
}
return nil
}
tx := r.db.WithContext(ctx).Omit("id").Create(descriptionEntity)
tx := r.db.WithContext(ctx).Omit("id").Create(&input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is technically not necessary since this is all wrapped in a transaction, and if any insert fails then the whole transaction should be rolled back.

Copy link
Contributor

@katrogan katrogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! thank you so much for taking on these changes

for the testing:

Then, we make a shell script to register 2 groups of tasks with same ID but different digest at the same time.
did you modify the task definition to force a different task digest? I didn't quite follow from the description

@eapolinario
Copy link
Contributor

/review

@flyte-bot
Copy link
Collaborator

flyte-bot commented Dec 27, 2024

Code Review Agent Run #ec8f77

Actionable Suggestions - 2
  • flyteadmin/pkg/manager/impl/task_manager.go - 1
    • Consider debug level for non-critical errors · Line 115-115
  • flyteadmin/pkg/manager/impl/task_manager_test.go - 1
Review Details
  • Files reviewed - 3 · Commit Range: c3134ec..28481a0
    • flyteadmin/pkg/manager/impl/task_manager.go
    • flyteadmin/pkg/manager/impl/task_manager_test.go
    • flyteadmin/pkg/repositories/gormimpl/task_repo.go
  • Files skipped - 0
  • Tools
    • Golangci-lint (Linter) - ✖︎ Failed
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful

AI Code Review powered by Bito Logo

@flyte-bot
Copy link
Collaborator

Changelist by Bito

This pull request implements the following key changes.

Key Change Files Impacted
Bug Fix - Fix Race Condition in Task Registration

task_manager.go - Refactored task creation logic to handle concurrent registrations

task_repo.go - Modified task creation order to write Task before Description

task_manager_test.go - Added test cases for duplicate task registration scenarios

// See if an identical task already exists by checking the error code
flyteErr, ok := err.(errors.FlyteAdminError)
if !ok || flyteErr.Code() != codes.AlreadyExists {
logger.Errorf(ctx, "Failed to create task model with id [%+v] with err %v", request.GetId(), err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider debug level for non-critical errors

Consider using logger.Debugf instead of logger.Errorf for non-critical database errors. The error is already being returned to the caller.

Code suggestion
Check the AI-generated fix before applying
Suggested change
logger.Errorf(ctx, "Failed to create task model with id [%+v] with err %v", request.GetId(), err)
logger.Debugf(ctx, "Failed to create task model with id [%+v] with err %v", request.GetId(), err)

Code Review Run #ec8f77


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

}, nil
})
mockRepository.TaskRepo().(*repositoryMocks.MockTaskRepo).SetCreateCallback(func(input models.Task, descriptionEntity *models.DescriptionEntity) error {
return adminErrors.NewFlyteAdminErrorf(codes.AlreadyExists, "task already exists")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding more error details

Consider making the error message more descriptive by including task identifier details in adminErrors.NewFlyteAdminErrorf() call

Code suggestion
Check the AI-generated fix before applying
Suggested change
return adminErrors.NewFlyteAdminErrorf(codes.AlreadyExists, "task already exists")
return adminErrors.NewFlyteAdminErrorf(codes.AlreadyExists, "task %v already exists", input.TaskKey)

Code Review Run #ec8f77


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

@eapolinario eapolinario merged commit 61838b4 into flyteorg:master Dec 27, 2024
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Housekeeping] Flyteadmin digest comparison should rely on database semantics
4 participants